Besides the text content, documents and their associated words usually comewith rich sets of meta informa- tion, such as categories of documents andsemantic/syntactic features of words, like those encoded in word embeddings.Incorporating such meta information directly into the generative process oftopic models can improve modelling accuracy and topic quality, especially inthe case where the word-occurrence information in the training data isinsufficient. In this paper, we present a topic model, called MetaLDA, which isable to leverage either document or word meta information, or both of themjointly. With two data argumentation techniques, we can derive an efficientGibbs sampling algorithm, which benefits from the fully local conjugacy of themodel. Moreover, the algorithm is favoured by the sparsity of the metainformation. Extensive experiments on several real world datasets demonstratethat our model achieves comparable or improved performance in terms of bothperplexity and topic quality, particularly in handling sparse texts. Inaddition, compared with other models using meta information, our model runssignificantly faster.
展开▼